1. Introduction

Turbofan engines are gas turbine engines often use in aircraft propulsion. In this document, we analyze data relative to such engines. The aim is to model and predict the Remaining Useful Life (RUL) as accurately as possible. In a real-life setup, such a model can be used for:

  • Root-cause analysis, analyzing historical data and identifying patterns or behaviors that lead to failures
  • Real-time monitoring, generating alerts before the actual failure occurs

1.1 Problem definition

The diagram below summarizes the structure of a turbofan engine:

Diagram of engine

The data set consists of multiple multivariate time series. Each time series is from a different engine, i.e., the data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation are considered normal, i.e., it is not considered a fault condition. Three operational settings have a substantial effect on engine performance. These settings are also included in the data. The data is contaminated with sensor noise.

The engine is operating normally at the start of each time series, and develops a fault at some point during the series: the fault grows in magnitude until system failure.

1.2 Modeling Process and Approach

The general modeling outline is described in the diagram below and is as follows:

  1. Load data. In this case, data is stored in text files. Generally speaking, data can be fetched from any source like xlsx files or a database
  2. Pre-process data, also called data cleaning. This step is crucial to understand and trust the data. One would often iterate between:
    • Transformation: filtering, standardization, filling missing values, engineering new features
    • Visualization: graphical summary of the data helps to identify possible issues
  3. Model training and selection. Here we search over different modeling approaches and select the optimal one based on relevant metrics. Techniques as cross validation and grid search for hyperparameter optimization are very useful.
  4. Using the best modeling approach we do:
    • Root-cause analysis based on historical data
    • Predictions using new online data

Workflow data modeling

In this notebook we model the engine failures from two different points of view, trying to answer the following questions:

How many cycles are left until the engine fails?

In this case we deal with regression models

Will the engine fail in less than 10 cycles? What is the probability?

This case falls into the category of classification modeling

1.3 Library import and enviroment setup

We use existing data science libraries that provide reliability and speed, combined with a custom-made library that focuses on the specific problematic of predictive maintenance.

2. Data Exploration and Cleaning

2.1 Data Loading and Processing

Let's download the data to a local folder and unzip it

Raw data has been downloaded
id cycle op1 op2 op3 FanInletTemp LPCOutletTemp HPCOutletTemp LPTOutletTemp FanInletPres BypassDuctPres TotalHPCOutletPres PhysFanSpeed PhysCoreSpeed EnginePresRatio StaticHPCOutletPres FuelFlowRatio CorrFanSpeed CorrCoreSpeed BypassRatio BurnerFuelAirRatio BleedEnthalpy DemandFanSpeed DemandCorrFanSpeed HPTCoolantBleed LPTCoolantBleed
0 1 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 21.61 554.36 2388.06 9046.19 1.3 47.47 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 21.61 553.75 2388.04 9044.07 1.3 47.49 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236
2 1 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 21.61 554.26 2388.08 9052.94 1.3 47.27 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.0 38.95 23.3442
3 1 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 21.61 554.45 2388.11 9049.48 1.3 47.13 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.0 38.88 23.3739
4 1 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 21.61 554.00 2388.06 9055.15 1.3 47.28 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.0 38.90 23.4044
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20626 100 196 -0.0004 -0.0003 100.0 518.67 643.49 1597.98 1428.63 14.62 21.61 551.43 2388.19 9065.52 1.3 48.07 519.49 2388.26 8137.60 8.4956 0.03 397 2388 100.0 38.49 22.9735
20627 100 197 -0.0016 -0.0005 100.0 518.67 643.54 1604.50 1433.58 14.62 21.61 550.86 2388.23 9065.11 1.3 48.04 519.68 2388.22 8136.50 8.5139 0.03 395 2388 100.0 38.30 23.1594
20628 100 198 0.0004 0.0000 100.0 518.67 643.42 1602.46 1428.18 14.62 21.61 550.94 2388.24 9065.90 1.3 48.09 520.01 2388.24 8141.05 8.5646 0.03 398 2388 100.0 38.44 22.9333
20629 100 199 -0.0011 0.0003 100.0 518.67 643.23 1605.26 1426.53 14.62 21.61 550.68 2388.25 9073.72 1.3 48.39 519.67 2388.23 8139.29 8.5389 0.03 395 2388 100.0 38.29 23.0640
20630 100 200 -0.0032 -0.0005 100.0 518.67 643.85 1600.38 1432.14 14.62 21.61 550.79 2388.26 9061.48 1.3 48.20 519.30 2388.26 8137.33 8.5036 0.03 396 2388 100.0 38.37 23.0522

20631 rows × 26 columns

2.2 Data cleaning and visualization

Now that the data is loaded, we do the following pre-processing:

  • Remove correlated sensors. Some sensors have a correlation of "almost" 1 or -1. This means they might have different scales but their behavior is the same. Hence, we remove all the correlated sensors but one.
  • Constant data with time. Some sensor like FanInletTemp are constant, i.e. the sensor signal moght have gotten stuck. This sensor holds no relevant information, hence it shoudl be removed from the dataset.

After the data cleanig we visualize the distribution of the remaining features just to make sure nothings stands outs (outliers and unexpected behaviors)

Removing ['FanInletPres', 'EnginePresRatio', 'BurnerFuelAirRatio'] from columns

On the left plot, we observe that some features are hihgly correlated. On the right plot we confirm they have been remove

Next, let's plot the distribution of the sensor data.

<Figure size 1080x1080 with 0 Axes>

It seems like some values are constant. We could have guessed this by looking at the raw table. We clean those constant columns too.

Removing constant columns: ['op3', 'FanInletTemp', 'BypassDuctPres', 'PhysFanSpeed', 'CorrFanSpeed', 'DemandFanSpeed', 'DemandCorrFanSpeed']
id cycle op1 op2 LPCOutletTemp HPCOutletTemp LPTOutletTemp TotalHPCOutletPres PhysCoreSpeed StaticHPCOutletPres FuelFlowRatio CorrCoreSpeed BypassRatio BleedEnthalpy HPTCoolantBleed LPTCoolantBleed RUL
0 1 1 -0.0007 -0.0004 641.82 1589.70 1400.60 554.36 9046.19 47.47 521.66 8138.62 8.4195 392 39.06 23.4190 191
1 1 2 0.0019 -0.0003 642.15 1591.82 1403.14 553.75 9044.07 47.49 522.28 8131.49 8.4318 392 39.00 23.4236 190
2 1 3 -0.0043 0.0003 642.35 1587.99 1404.20 554.26 9052.94 47.27 522.42 8133.23 8.4178 390 38.95 23.3442 189
3 1 4 0.0007 0.0000 642.35 1582.79 1401.87 554.45 9049.48 47.13 522.86 8133.83 8.3682 392 38.88 23.3739 188
4 1 5 -0.0019 -0.0002 642.37 1582.85 1406.22 554.00 9055.15 47.28 522.19 8133.80 8.4294 393 38.90 23.4044 187
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20626 100 196 -0.0004 -0.0003 643.49 1597.98 1428.63 551.43 9065.52 48.07 519.49 8137.60 8.4956 397 38.49 22.9735 4
20627 100 197 -0.0016 -0.0005 643.54 1604.50 1433.58 550.86 9065.11 48.04 519.68 8136.50 8.5139 395 38.30 23.1594 3
20628 100 198 0.0004 0.0000 643.42 1602.46 1428.18 550.94 9065.90 48.09 520.01 8141.05 8.5646 398 38.44 22.9333 2
20629 100 199 -0.0011 0.0003 643.23 1605.26 1426.53 550.68 9073.72 48.39 519.67 8139.29 8.5389 395 38.29 23.0640 1
20630 100 200 -0.0032 -0.0005 643.85 1600.38 1432.14 550.79 9061.48 48.20 519.30 8137.33 8.5036 396 38.37 23.0522 0

20631 rows × 17 columns

The table above is clean and reasy to be used. Let's visualize some of the time series separately. Onthe X axis, we show the Remaining Useful Life (RUL). As we approach RUL=0 it means the engine is about to fail.

On the graph below, select the variable to plot in the dropdown menu. Variable PhysCoreSpeed seems quite relevant, as its values diverge as the RUL decreases.

3. Remaining Useful Life (RUL) Modeling

In this section we focus on developing a model that answers the question to how many cycles are left until the engine fails. From the machine learning point of view, this means we train a regression model.

3.1 Regression Model: Training and Selection

We perform a comparison of 3 different models with several combinations of hyperparameters and automatically select the best one:

  1. Linear model with lasso penalization
  2. Gradient boosted regression trees
  3. Neural network: Multi-Layer Perceptron (MLP) regressor.

When selecting a model, we respect the principles of cross-validation to avoid overfitting.

Running GridSearchCV for linear.
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:    1.7s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
Running GridSearchCV for lgb.
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   27.2s
[Parallel(n_jobs=2)]: Done  80 out of  80 | elapsed:   47.0s finished
Running GridSearchCV for mlp.
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  30 out of  30 | elapsed:  2.9min finished
CPU times: user 39.1 s, sys: 6.17 s, total: 45.3 s
Wall time: 3min 55s

Below we show the top 5 models.

Winner model: Pipeline(steps=[('std', MinMaxScaler()),
                ('m1',
                 MLPRegressor(early_stopping=True, hidden_layer_sizes=(64, 32),
                              learning_rate_init=0.01))])
Total number of models: 23
mean_fit_time std_fit_time mean_score_time std_score_time params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score model param_learning_rate param_max_depth param_n_estimators param_m1__hidden_layer_sizes param_m1__learning_rate_init
5 8.670599 0.911532 0.009819 0.001267 {'m1__hidden_layer_sizes': (64, 32), 'm1__lear... -35.869724 -35.682568 -35.185438 -37.024977 -35.555583 -35.863658 0.622316 1 -35.771655 -35.671437 -35.861927 -35.551381 -35.749068 -35.721094 0.104347 mlp NaN NaN NaN (64, 32) 0.01
15 1.440178 0.182225 0.045990 0.002670 {'learning_rate': 0.01, 'max_depth': 20, 'n_es... -36.085727 -36.026132 -35.596279 -37.287655 -35.836283 -36.166415 0.586073 1 -32.142465 -32.193598 -32.205210 -31.863161 -32.120410 -32.104969 0.124915 lgb 0.01 20 500 NaN NaN
13 1.542662 0.067853 0.054697 0.007486 {'learning_rate': 0.01, 'max_depth': 15, 'n_es... -36.106520 -36.029922 -35.579115 -37.291051 -35.844181 -36.170158 0.589187 2 -32.171451 -32.197594 -32.200700 -31.848941 -32.125479 -32.108833 0.132710 lgb 0.01 15 500 NaN NaN
9 1.079573 0.024617 0.044747 0.004913 {'learning_rate': 0.01, 'max_depth': 5, 'n_est... -36.193871 -35.995038 -35.522113 -37.279388 -35.861670 -36.170416 0.596112 3 -33.538418 -33.570205 -33.571423 -33.170582 -33.464165 -33.462958 0.151292 lgb 0.01 5 500 NaN NaN
11 1.693971 0.121202 0.053022 0.008216 {'learning_rate': 0.01, 'max_depth': 10, 'n_es... -36.101462 -36.011962 -35.598674 -37.273967 -35.866133 -36.170440 0.577436 4 -32.248436 -32.253799 -32.335444 -31.960118 -32.210960 -32.201751 0.127463 lgb 0.01 10 500 NaN NaN
Test Mean Absolute Error (MAE): 24.729
Test Root Mean Squared Error (RMSE): 35.182

The "test" error is what we can expect to see on new unseen data.

3.2 Model diagnostics (root-cause analysis)

Below we graph the actual RUL versus the predictions. On the left scatter plot, ideally, they should be equal (i.e. represented by the diagonal line). Generally speaking, the point of clouds looks as expected in a well-fitted model: it is proportional to the magnitude of the remaining useful life and there is not any outliers.

On the right plot we show the actual vs predicted RUL for a selecion of couple of engines. Note that for 2 of those engines (39 and 71), the model predicts quite well the RUL. For engine 96, the model underestimates the RUL while for engine 39 the model overestimated the RUL.

Next, we analyze the trained model and infer how each of the sensors contribute to the decrease of RUL. Conclusions are stated at the bottoom.


From the SHAP plot above we conclude that:

  • Features are sorted in order of relevance from top to bottom
  • Higher values of cycle (highligthed in red) are associated with lower RUL. This makes a lot of sense: the longer the engine has been running, the closest it is to failure
  • The second most important feature is StaticHPCOutletPres: higer values of pressure indicate lower RUL.
  • Analogous conclucions are drawn for the third and fourth most important variables (PysCoreSpeed and CorrCoreSpeed)

Similar conclusions are drawn from the Partial Dependence PLots (PDP) below. PDP plots show the expected change in RUL versus each variable taken independently. In other words, it shows the marginal effect of each of the variables or sensors. Note that on the y axis we show the change of RUL.

3.3 Conclusion from Root-Cause Analysis of RUL

  • The number of cycles has a big impact on the RUL. The effect of the cycle stabilizes once the engine reaches 120 cycles.
  • The sensor corresponding to StaticHPCOutletPres is the most important: higer values of pressure indicate lower RUL. Can we limit this pressure in practice? This could be discussed with an expert in turbofan engines.
  • The operational chacteristics, which a priori would be relevant, are not important when predicting the RUL.

4. Failure alerts

In this section, we focus on the case were an engine operator needs a real-time alert system that raises al alarm when the engine is likely to fail in less than 10 cycles. In other words, we answer the following question:

Will the engine fail in less than 10 cycles? With what probability?

We utilise the cleaned dataset that has been discussed aobve, but change the way of modeling the target. Instead of modeling "number of cycles", we model the probability that the engine fails within 10 cycles. Because we are dealing with binary outcomes (failure/no failure), we model the target as a binary classification problem.

4.1 Classification model: Training and Selection

The approach we follow is very similar as above, except that now we use classification models:

  1. Logistic regression
  2. Boosted classification trees
  3. Neural networks (Multilayer perception) with log-loss function

We train the 3 types of models with different combination of hyperparameters, respecting the principles of cross-validation to avoid overfitting. The best modle is chosen according to the f1-score.

Running GridSearchCV for lgb.
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   24.4s
[Parallel(n_jobs=2)]: Done  80 out of  80 | elapsed:   42.3s finished
Running GridSearchCV for logistic.
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:    1.8s finished
Running GridSearchCV for mlp.
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  30 out of  30 | elapsed:   42.4s finished
CPU times: user 24.4 s, sys: 3 s, total: 27.4 s
Wall time: 1min 32s

Below we show the top 5 best models. Again, in this case, tree-based model outperform neural networks and logstic regression models, both in accuracy and in training times.

mean_fit_time std_fit_time mean_score_time std_score_time param_learning_rate param_max_depth param_n_estimators params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score model param_m1__hidden_layer_sizes param_m1__learning_rate_init
13 1.432815 0.043351 0.034637 0.005197 0.01 15 500 {'learning_rate': 0.01, 'max_depth': 15, 'n_es... 0.850440 0.840909 0.811429 0.796610 0.814159 0.822709 0.019908 1 0.992908 0.991489 0.994310 0.994302 0.994294 0.993461 0.001124 lgb NaN NaN
15 1.420245 0.186505 0.033075 0.002451 0.01 20 500 {'learning_rate': 0.01, 'max_depth': 20, 'n_es... 0.847953 0.843305 0.810345 0.797753 0.809384 0.821748 0.020050 2 0.993612 0.990071 0.993612 0.995018 0.996436 0.993750 0.002116 lgb NaN NaN
9 1.214001 0.182729 0.047006 0.006534 0.01 5 500 {'learning_rate': 0.01, 'max_depth': 5, 'n_est... 0.840580 0.840580 0.819209 0.790960 0.816327 0.821531 0.018400 3 0.927247 0.923735 0.934097 0.930000 0.929638 0.928943 0.003411 lgb NaN NaN
11 1.548421 0.066568 0.034493 0.001270 0.01 10 500 {'learning_rate': 0.01, 'max_depth': 10, 'n_es... 0.849558 0.832370 0.819484 0.793201 0.807018 0.820326 0.019561 4 0.987216 0.987234 0.989339 0.986419 0.988588 0.987759 0.001054 lgb NaN NaN
4 2.154439 0.578780 0.011956 0.001952 NaN NaN NaN {'m1__hidden_layer_sizes': (64, 32), 'm1__lear... 0.814159 0.844311 0.822222 0.803371 0.805714 0.817956 0.014764 1 0.814923 0.814040 0.823366 0.823613 0.830189 0.821226 0.006033 mlp (64, 32) 0.005
F1 score test: 0.836

4.2 Model Diagnostics and Interpretation

Below we graph the predicted probability of failure given by the model versus the actual remaining useful life, for one of the engines. The model outputs reasonable probabilities on most occasions. Its confidence grows as the failure time approaches.

Classification models return a probabbility of failure. If we define a threshold, we can translate this probability to a binary variable: 1 indicates failure, 0 indicates no failure. Below, we select a few thresholds and compare how many failures are predicted correctly (true positives) versus how many false positives we observe. A false positives means we predict "failure" but in fact, it was not.

There is a certain trade-off between true and false positives. Depending on the cost of a false positive and the benefit from true positives we can establish a theshold to be used in practice.

It is interesting to look at the ratios too and not just as the absolute values. We define the following metrics which are universal for all classification problems. The reader is referred to this article for further information:

  • Precision: fraction of predictions were correct
  • False positive rate: fraction of times we predict a failure and then it is not
  • Recall: fraction of failures were predicted correctly. In other words, amongst all failures, how many do we predict correctly?
  • Threshold: translates probability of failure to a binary variable.

Looking at the ROC-Precision graph below is another way of choose an optimal threshold based on the business needs. The table shows the same data as the graph.

thresholds precision recall fpr TP TN FP FN
100 1.0 NaN 0.000 0.000 0 3904 0 223
90 0.9 0.971 0.457 0.001 102 3901 3 121
80 0.8 0.924 0.650 0.003 145 3892 12 78
70 0.7 0.907 0.740 0.004 165 3887 17 58
60 0.6 0.892 0.780 0.005 174 3883 21 49
50 0.5 0.848 0.825 0.008 184 3871 33 39
40 0.4 0.824 0.861 0.011 192 3863 41 31
30 0.3 0.788 0.901 0.014 201 3850 54 22
20 0.2 0.754 0.946 0.018 211 3835 69 12
10 0.1 0.679 0.960 0.026 214 3803 101 9
0 0.0 0.054 1.000 1.000 223 0 3904 0

If we select a threshold of 0.5, we can expect 0.8% of false positives, and more than 80% in precision and recall. This seems reasonable from a business-perspective. Below we show the confusion matrix for such selected threshold.

Confusion matrix, without normalization

5. Summary and Conclusions

In this notebook we deal with data relative to turbofan engines. We have pre-processed a dataset and visualized it to spot outliers and unexpected behaviors. Once it was clean and ready to be ingested by machine learning models, we trained two different types of models:

  • A regression model: How many cycles are left until the engine fails?
  • A classification model: Will the engine fail in less than 10 cycles? What is the probability?

During the modeling phase, we tested 3 different types of models with numerous hyperparameter combinations, always respecting the principles of cross validation and selecting the best one objectively based on relevant metrics.

From the regression model, we concluded that the number of cycles since the engine started is the most relevant factor when predicting the remaining useful life. The next most important variable is StaticHPCOutletPres, so further analysis together with turbofan operators should be conducted to understand the physical implication of this variable. We also concluded that the operational characteristics of the engines are not relevant features.

When predicting if the engine wil fail in less than 10 cycles, the model achieves precisions of 80-90% with a false positive rate of 0.1-0-8%. A real-time alert system could be implemented using the developed methodology.

6. Further Work

From the data and modeling point of view, this problem could be improved in two ways:

  • Gather more data.
  • Create new columns, also known as Feature Engineering. For example, we could add a moving-average window to each sensor that filters out noise.
  • Further model architectures could be explored, especially the ones that take advantage of inter-temporal relationships.

On a practical level, it would be interesting to answer further business questions. For example:

  • Will the engine fail in less than 10 cycles, in between 10 and 20, or in more than 20? This could be implemented in practice as a red-yellow-green trafficlight.
  • Can we detemine the most likely cause of the next failure? We would need to explore shap values for the classification problem as well.